D1.1.a: Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers.
D1.1.b: Strategies to address the different data challenges such as data pollution, outlier’s treatment and missing values treatment.
D1.1.c: Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.
D1.2.a: Bi-variate analysis between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.
D1.2.b: Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots.
D2.1: Ensure the attribute types are correct. If not, take appropriate actions. D2.2: Get the data model ready. D2.3: Transform the data i.e. scale / normalize if required D2.4: Create the training set and test set in ratio of 70:30
First create models using Logistic Regression and Decision Tree algorithm. Note the model performance by using different matrices. Use confusion matrix to evaluate class level metrics i.e. Precision/Recall. Also reflect the accuracy and F1 score of the model.
Build the ensemble models: Bagging and Boosting. Note the model performance by using different matrices. Use same metrics as in above model. (at least 3 algorithms).
Make a DataFrame to compare models and their metrics. Give conclusion regarding the best algorithm and your reason behind it.
Bank client data:
Other attributes:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import pandas_profiling
from sklearn.impute import SimpleImputer
import seaborn as sns # Import data visualization library for statistical graphics
import matplotlib.pyplot as plt # Import data visualization library
from sklearn import metrics # For Linear, Logistic Regressions, Decision Tree
from sklearn.model_selection import train_test_split # For LinR, LogR, DTree
# ====== For Linear Regression ======
from scipy.stats import zscore, pearsonr
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.classifier import ClassificationReport, ROCAUC
# ====== For Logistic Regression ======
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score
from sklearn.metrics import f1_score, roc_curve, roc_auc_score, classification_report
# ====== For Decision Tree ======
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image, Markdown
import pydotplus as pdot # to display decision tree inline within the notebook
import graphviz as gviz
# DTree does not take strings as… # … input for the model fit step....
from sklearn.feature_extraction.text import CountVectorizer
# ======= For Ensemble Techniques =======
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
# ======= Set default style ========
# Multiple output displays per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>")) # Increase cell width
# Remove scientific notations to display numbers with 2 decimals
pd.options.display.float_format = '{:,.2f}'.format
plt.figure(figsize=(12,8))
sns.set_style(style='darkgrid')
%matplotlib inline
# ===== Options =====
import pickle # For model export
from os import system # For system (eg MacOS, etc) commands from within python
# Increase max number of rows and columns to display in pandas tables
pd.set_option('display.max_columns', 100) # Max df cols to display set to 100.
pd.set_option('display.max_rows', 50) # Max df rows to display set to 50.
# pd.set_option('display.max_rows', tdf.shape[0]+1) # just one row more than the total rows in df
# Update default style and size of charts
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [10, 8]
vvv D1.1.a Univariate analysis : Data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers. Starts Below: vvv
# Read & Load the input Datafile into dataset frame:
tdf = pd.read_csv('bank_full.csv')
# Display the df rows from head, tail and random sample rows:
tdf
tdf.sample(7)
# Incremental DF Backup 0 as of now:
tdf0 = tdf.copy() # Original Df
# Verify backup copy
tdf0.shape, type(tdf0)
tdf0.sample(7)
profile = pandas_profiling.ProfileReport(tdf)
profile
| Old Name | New Name | Column Name Meaning & Data Description | |
|---|---|---|---|
| 01 | age | age | Age in years |
| 02 | job | job | Profession / Occupation Name/Title (descriptive) |
| 03 | marital | marital | Marital Status (Single, Married, etc.) |
| 04 | education | edu | Education Level (Primary, Secondary, etc.) |
| 05 | default | dflt | Has Customer failed to pay (default) on an amount due on a loan/payment/etc ? |
| 06 | balance | bbal | Bank Account Balance in monetary numeric value |
| 07 | housing | hloan | Does Customer has a Housing Loan (Yes, No, etc) ? |
| 08 | loan | ploan | Does Customer has a Personal Loan (Yes, No, etc) ? |
| 09 | contact | comtyp | Type of communication contact with the Customer (cell, phone, etc.) |
| 10 | day | mntday | Day of the month when the Customer was contacted for sales/marketing campaign |
| 11 | month | month | Month in which the Customer was contacted for sales/marketing campaign |
| 12 | duration | talktm | Current Campaign: Sales/Marketing call time duration in seconds (talk time) |
| 13 | campaign | talknm | Current Campaign: Number of times the Customer was contacted (talked to) for Sales/Marketing |
| 14 | pdays | pdays | Previous Campaign: Number of days passed since the last contact with the Customer |
| 15 | previous | ptalknm | Previous Campaign: Number of times the Customer was contacted (talked to) for Sales/Marketing |
| 16 | poutcome | presult | Previous Campaign: Did Customer accept Term Deposit? (Yes, No, etc) |
| 17 | Target | Target | Current Campaign: Did Customer accept Term Deposit? (Yes, No, etc) |
# Rename column names for convenience, meaningfulness and intuitiveness:
tdf.rename(columns={'education': 'edu', 'default': 'dflt', 'balance': 'bbal', 'housing': 'hloan', 'loan': 'ploan',
'contact': 'comtyp', 'day': 'mntday', 'duration': 'talktm', 'campaign': 'talknm', 'previous': 'ptalknm',
'poutcome': 'presult'}, inplace=True, errors='raise')
tdf
# Incremental DF Backup 1 as of now:
tdf1 = tdf.copy() # Modified Df: Renamed Col. names
# Verify backup copy
tdf1.shape, type(tdf1)
tdf1.sample(7)
Markdown('### * DF Shape Number of (Rows, Columns) & DF Type:')
tdf.shape, type(tdf)
Markdown('### * DF Info with : Column Names & Data Types')
tdf.info()
Markdown('### * DF Stats for Continuous/Numeric value columns: Range (Min & Max), Central values (Mean), Std.D, Quartiles')
tdf.describe()
Markdown('#### * DF Stats for Continuous/Numeric value columns: Central values: MEDIAN which are not in Standard "Describe/Stats" ^above:')
tdf.median()
Markdown('### * DF Stats for Categorical/Non Numeric columns:')
tdf.describe(exclude='number')
Markdown('### * DF Stats for All columns: Central values: MODE which are not in Standard "Describe/Stats" ^above:')
tdf.mode()
Markdown(""" ### * DF Number of Duplicate Rows Based on All 17 Columns: {dup} """.format(dup=tdf.duplicated().sum()))
Markdown('### * DF Number of Duplicate Rows Based on Certain Customer "Identifying Information" Columns:')
print('* Duplicate Rows for first 11 Columns: age, job, marital, edu, dflt, bbal, hloan, ploan, comtyp, mntday, month:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan','ploan','comtyp','mntday','month']).sum())
print('* Duplicate Rows for first 10 Columns: age, job, marital, edu, dflt, bbal, hloan, ploan, comtyp, mntday:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan','ploan','comtyp','mntday']).sum())
print('* Duplicate Rows for first 9 Columns: age, job, marital, edu, dflt, bbal, hloan, ploan, comtyp:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan','ploan','comtyp']).sum())
print('* Duplicate Rows for first 8 Columns: age, job, marital, edu, dflt, bbal, hloan, ploan:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan','ploan']).sum())
print('* Duplicate Rows for first 7 Columns: age, job, marital, edu, dflt, bbal, hloan:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan']).sum())
print('* Duplicate Rows for first 6 Columns: age, job, marital, edu, dflt, bbal:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal']).sum())
print('* Duplicate Rows for first 6 Columns & "Target": age, job, marital, edu, dflt, bbal & Target:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','Target']).sum())
print('* Duplicate Rows for first 8 Columns & "Target": age, job, marital, edu, dflt, bbal, hloan, ploan & Target:',
tdf.duplicated(['age','job','marital','edu','dflt','bbal','hloan','ploan','Target']).sum())
Markdown('### * DF Number of Duplicate Columns: 2 Columns = "pdays" and "ptalknm" (Highly Possible):'
""" Duplicate Rows for these two columns: {dup} = 94.6% = {dup} * 100 / {tot} = DupRows*100/TotalRows
""".format(dup=tdf.duplicated(['pdays','ptalknm']).sum(), tot=tdf.shape[0]))
Markdown('### * DF Null values for All columns: None:')
tdf.isna().sum()
Markdown('### Unique Values for All columns:')
tdf.nunique()
Markdown('### * DF Numeric columns having Zero values:')
(tdf.select_dtypes(include='number') == 0).sum()
Markdown('### * DF Numeric columns having -ve values:')
(tdf.select_dtypes(include='number') < 0).sum()
Markdown('### * DF Values Counts for Categorical column "JOB":')
tdf.job.value_counts()
Markdown('### * DF Values Counts for Categorical column "MONTH":')
tdf.month.value_counts()
Markdown('### * DF Values Counts for the remaining 8 Categorical columns out of total 10 Cat. cols.:')
tdf[['marital','edu','dflt','hloan','ploan','comtyp','presult','Target']].apply(pd.value_counts)
Markdown('### * Class Imbalance% for the TARGET In Percentage: "Target" Column Value Counts (Normalized):')
tdf['Target'].value_counts(normalize=True)*100
Markdown('###### *** There IS SIGNIFICANT CLASS IMBALANCE in Target Column as per above^^^')
# Indetify Outlier Values in All Non Categorical / Numeric Columns:
Markdown('### * DF Outliers for Non Categorical / Numeric Columns: $ Low = Q1 - (IQR * 1.5) $; $High = Q3 + (IQR *1.5 $)')
for col in tdf.select_dtypes(include='number'):
q1 = tdf[col].quantile(.25)
q3 = tdf[col].quantile(.75)
otr = (q3 - q1) * 1.5
otl = q1 - otr
oth = q3 + otr
print('\nNumber Of Outliers values Under Low End (', otl, ') for column:', col, '=', (tdf[col] < otl).sum())
print( 'Number Of Outliers values Over High End (', oth, ') for column:', col, '=', (tdf[col] > oth).sum())
# Create col.name list for Cat and non.Cat cols for convenience
numcols = tdf.select_dtypes(include='number').columns
catcols = tdf.select_dtypes(exclude='number').columns
print('\n* Cat.Cols:', catcols)
print('\n* Non Cat.Cols:', numcols)
print('\n')
# Change columns datatype from 'object' to 'category'
tdf.info()
tdf[catcols] = tdf[catcols].astype('category')
tdf.info()
# Incremental DF Backup 2 as of now:
tdf2 = tdf.copy() # Modified Df: Cat cols (NonNum Cols, type='object') changed to type 'category'
# Verify backup copy
tdf2.shape, type(tdf2)
tdf2.sample(5)
tdf2.info()
# Histogram of all 7 NonCategorical columns: Visual Distribution of column values:
tdf[numcols].hist(stacked=False, bins=100, figsize=(20,30), layout=(9, 3));
# A closer look at values for cols 'mntday', 'talknm' with log values of log('talknm')
fig, axs = plt.subplots(ncols = 2, figsize = (30, 10))
sns.distplot(tdf.mntday, hist=True, ax = axs[0])
sns.distplot(np.log(tdf.talknm), ax = axs[1])
# Count Plot of all 10 Categorical columns: Visual value_counts of all columns of type 'category'
fig, axs = plt.subplots(ncols = len(catcols), figsize = (30, 5))
j=0
for i in tdf[catcols]:
sns.countplot(tdf[i], ax = axs[j], hue=tdf.Target)
j = j+1
# A closer look at values for cols 'job', 'month'
fig, axs = plt.subplots(ncols = 3, figsize = (30, 5))
j=0
for i in ['job', 'month']:
# plt.figure(figsize = (4,2))
sns.countplot(tdf[i], ax = axs[j], hue=tdf.Target)
j = j+1
sns.distplot(pd.to_datetime(tdf.month, format='%b').dt.month, hist=True, ax = axs[2])
Data Pollution / Noise:
Outlier Values:
Missing, Unknown / Unspecified Values:
Negative Values:
Zero Values:
D1.1.c vvv Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots. Begin Below: vvv
### Insight Drawn From the foregoing Plots:
### Steps To Take: Encoding Strategy for Cat. & NonCat. Column Values:
### Steps To Take: In a Nut Shell:
D1.2.a Bi-variate analysis between the Predictor variables and Target column . Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms or density curves. Select the most appropriate attributes.
Markdown('### * Pair Plot Study of Predictor varibales and Target column (Hue="Target") with Density Curves (diag_kind="kde")')
sns.pairplot(tdf, diag_kind='kde', hue='Target')
Markdown('### * Box Plot Study of Predictor varibales and Target column (Hue="Target")')
fig, axs = plt.subplots(ncols = len(numcols), figsize = (30, 5))
j=0
for i in tdf[numcols]:
sns.boxplot(tdf[i], ax = axs[j], orient='v', hue=tdf.Target)
j = j+1
# Optional Exhibits:
tdf.corr()
sns.heatmap(tdf.corr())
D1.2.b Please provide comments in jupyter notebook regarding the steps you take and insights drawn from the plots
tdf.info()
tdf.head()
# Drop columns as Identified in earlier Deliverable D1.x.x Notes: 'talktm', 'pdays', 'ptalknm', 'presult'
tdf.head()
tdf.info() # Before drop
tdf.drop(columns=['talktm', 'pdays', 'ptalknm', 'presult'], inplace=True)
tdf.head()
tdf.info() # After drop
# Incremental DF Backup 3 as of: Fri.Jun.19 04:59pm
tdf3 = tdf.copy() # Modified Df: Dropped 4 Columns: 'talktm', 'pdays', 'ptalknm', 'presult'
# Verify backup copy
tdf3.shape, type(tdf3)
tdf3.sample(6)
tdf3.info()
# Prepare for Encoding Cat Cols as Discussed & Described in Previous Deliverables D1.x.x above
# Create main current DF "tdf" to a temp DF "xdf"
xdf = tdf.copy()
# Bin column 'age' and add the binned series to xdf as new col 'ageb' (age binned)
ageb = pd.cut(xdf.age, bins=10, labels=False, retbins=True) # Good!
xdf['ageb'] = ageb[0] # Get the Data Series Good!
ageb[1] # Display ageb Bins
# Bin column 'bbal' and add the binned series to xdf as new col 'balb' (bbal binned)
balb = pd.cut(xdf.bbal, bins=13, labels=False, retbins=True) # Good!
xdf['balb'] = balb[0] # Good# Get the Data Series Good!
balb[1] # Display balb Bins
# Bin column 'mntday' and add the binned series to xdf as new col 'mdayb' (mdayb binned)
mdayb = pd.cut(xdf.mntday, bins=3, labels=False, retbins=True) # Good!
xdf['mdayb'] = mdayb[0] # Good# Get the Data Series Good!
mdayb[1] # Display mdayb Bins
# Decode 'month' name text values to month numeric values
xdf.month = pd.to_datetime(xdf.month, format='%b').dt.month
# Bin column 'month' and add the binned series to xdf as new col 'mntb' (mntb binned)
mntb = pd.cut(xdf.month, bins=4, labels=False, retbins=True) # Good!
xdf['mntb'] = mntb[0] # Get the Data Series Good!
mntb[1] # Display mntb Bins
# Scale & Bin column 'talknm' and add the scaled & binned series to xdf as new col 'tnmb' (tnmb binned)
# Used "log" to scale the values and then used "round" in two sequences to derive 9 bin (1...9)
xdf['tnmb'] = ((round(round(np.log(xdf.talknm),1)/5,1)*10)+1) # *** Good ***
xdf.sample(6)
xdf.info()
# Attach the newly Generated Binned / Scalled columns from Temp df ("xdf") to Main df ("tdf"): ageb, mntb, mdayb, balb, tnmb
tdf3.sample(6) # Before
tdf3.info()
# Attach ageb Column: Age: binned in 10 (roughly a decade) slots
tdf['ageb'] = ageb[0]
print('\n* ageb Bins:', ageb[1])
# Attach mntb Column: Month: binned to Year Quarter (roughly 3 months or a 1/4 year per the peaks in the plot)
tdf['mntb'] = mntb[0]
print('\n* mntb Bins:', mntb[1])
# Attach mdayb Column: Month Day: binned to a 1/3 of a month (roughly 10 days per the peaks in the plot)
tdf['mdayb'] = mdayb[0]
print('\n* mdayb Bins:', mdayb[1])
# Attach balb Column: Bank Balance: binned in 13 (optimal) slots per the "pd.cut()" function's internal algorithm
tdf['balb'] = balb[0]
print('\n* balb Bins:', balb[1])
# Attach tnmb Column: Contact/Call/Talk Time (Current Campaign): Scaled with "log()" function then binned in 9 slots
tdf['tnmb'] = xdf['tnmb']
print('\n* tnmb Bins: 9 Bins (1...9)')
tdf.sample(6)
tdf.info() # After
# Incremental DF Backup 4 as of: Fri.Jun.19 08:15pm
tdf4 = tdf.copy() # Modified Df: Added 5 Binned Columns: ageb, mntb, mdayb, balb, tnmb
# Verify backup copy
tdf4.shape, type(tdf4)
tdf4.sample(6)
tdf4.info()
# After deriving 5 new binned cols from 5 old raw columns: Drop these Redundant raw cols: (age, bbal, month, mntday, talknm):
tdf.head() # Before drop
tdf.info()
tdf.drop(columns=['age', 'bbal', 'month', 'mntday', 'talknm'], inplace=True)
tdf.head()
tdf.info() # After drop
Ordinal Cat. Cols of type: "number" (numeric): ageb, mntb, mdayb, balb, tnmb
Column Data Type are Correct and OK to Proceed further processing!
# Incremental DF Backup 5 as of: Fri.Jun.19 09:03pm
tdf5 = tdf.copy() # Modified Df: Drop Redundant raw / old cols: (age, bbal, month, mntday, talknm) after new Binned cols
# Verify backup copy
tdf5.shape, type(tdf5)
tdf5.sample(6)
tdf5.info()
# One Hot Code all 8 Category Columns:
# Create dummies
hotcols = tdf.select_dtypes(include='category').columns
tdf.head() # Before
tdf.info()
tdf = pd.get_dummies(tdf, columns=hotcols, drop_first=True)
tdf.head()
tdf.info() # After
# Incremental DF Backup 6 as of: Fri.Jun.19 09:33pm
tdf6 = tdf.copy() # Modified Df: After One Hot Encode: Added 14 New cols
# Verify backup copy
tdf6.shape, type(tdf6)
tdf6.sample(6)
tdf6.info()
# Rename col names for Convenience:
# cols = tdf.columns
cols
tdf
tdf.rename(columns={'job_blue-collar': 'jblue',
'job_entrepreneur': 'jentr', 'job_housemaid': 'jmaid', 'job_management': 'jmgmt', 'job_retired': 'jrtrd',
'job_self-employed': 'jself', 'job_services': 'jsvcs', 'job_student': 'jstdnt', 'job_technician': 'jtchn',
'job_unemployed': 'jnemp', 'job_unknown': 'jnknwn', 'marital_married': 'mmrd', 'marital_single': 'msngl',
'edu_secondary': 'esec', 'edu_tertiary': 'etrt', 'edu_unknown': 'enknwn', 'dflt_yes': 'dflt', 'hloan_yes': 'hloan',
'ploan_yes': 'ploan', 'comtyp_telephone': 'cphone', 'comtyp_unknown': 'cnknwn', 'Target_yes': 'Trgt'}, inplace=True, errors='raise')
cols
tdf
tdf
tdf.info()
# Incremental DF Backup 7 as of: Fri.Jun.19 10:44pm
tdf7 = tdf.copy() # Modified Df: After renaming long cols name to shorten for convenience
tdf.to_csv('AIML_Project3_tdf_7_MSB.csv') # Also export as to .csv to disk
# Verify backup copy
tdf7.shape, type(tdf7)
tdf7.sample(6)
tdf7.info()
# Create the training set and test set in ratio of 70:30
# Define X and Y variables:
X = tdf.drop('Trgt',axis=1) # Predictor feature columns
Y = tdf['Trgt'] # Predicted class (1 = True, 0 = False)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= 0.3, random_state=3)
X_train
X_test
y_train
y_test
ttl = len(X), len(Y)
trn = len(X_train), len(y_train)
tst = len(X_test), len(y_test)
print('\n* Data split into 70:30 Ratio as Required: See Below:')
print('\n* TRAIN dataset %rows',round(trn[0]*100/ttl[0],2),'%')
print('* TEST dataset %rows',round(tst[0]*100/ttl[0],2),'%')
# Calc baseline proportion of the Predictor ("ploan") : Data Imbalan ce: Ratio of Yes ("1") to No ("0")
Yp = tdf['Trgt'].value_counts(normalize=True)
print('\n* There is some Class (Trgt) Imbalance: 0 = No; 1 = Yes')
print(Yp)
# Calc & Compare Percentages % :
print('\n* Percentage of True/False values of Predicted Class Trgt :\n')
print("Original Trgt True Values : {0} ({1:0.2f}%)".format(len( tdf.loc[tdf['Trgt'] == 1]), (len(tdf.loc[tdf['Trgt'] == 1])/len(tdf.index)) * 100))
print("Original Trgt False Values : {0} ({1:0.2f}%)".format(len( tdf.loc[tdf['Trgt'] == 0]), (len(tdf.loc[tdf['Trgt'] == 0])/len(tdf.index)) * 100))
print("")
print("Training Trgt True Values : {0} ({1:0.2f}%)".format(len( y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Trgt False Values : {0} ({1:0.2f}%)".format(len( y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Trgt True Values : {0} ({1:0.2f}%)".format(len( y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Trgt False Values : {0} ({1:0.2f}%)".format(len( y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("\n* Percentages seem uniform among Origina, Train, Test datas ets")
a. Create Models : First create models using Logistic Regression & Decision Tree algorithm.
b. Get Perferformance Metrics: Note the model performance by using different matrices.
c. Confusion Matrix : Use it to evaluate class level metrics i.e. Precision/Recall.
d. Metrics: Also reflect the Accuracy & F1 Score of the model.
# Build Logistic Regression Model:
# Fit the model on Train
model2 = LogisticRegression(solver="liblinear", penalty='l2', random_state=3)
model2.fit(X_train, y_train)
# Predict on test
y_predict = model2.predict(X_test)
coef_df = pd.DataFrame(model2.coef_)
coef_df['intercept'] = model2.intercept_
print(coef_df)
# Get the Accuracy (Score) of the Model against Training Data
accScore = model2.score(X_test, y_test)
print("Model2 Score = %f" %(accScore))
# Build the Confusion Matrix:
cm = metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
cm
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
df_cm
plt.figure(figsize = (11,7))
sns.heatmap(df_cm, annot=True, fmt="d", square=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
# Use Penalty "l1" to improve the model:
# Build Logistic Regression Model:
# Fit the model on Train
model1 = LogisticRegression(solver="liblinear", penalty='l1', random_state=3)
model1.fit(X_train, y_train)
# Predict on test
y_predict = model1.predict(X_test)
coef_df = pd.DataFrame(model1.coef_)
coef_df['intercept'] = model1.intercept_
print(coef_df)
# Get the Accuracy (Score) of the Model against Training Data
accScore = model1.score(X_test, y_test)
print("Model1 Score = %f" %(accScore))
# Get Other Metrics:
df_pred = pd.DataFrame(y_predict)
print("Recall:",recall_score(y_test,df_pred))
print("Precision:",precision_score(y_test,df_pred))
print("F1 Score:",f1_score(y_test,df_pred))
print("Roc Auc Score:",roc_auc_score(y_test,df_pred))
# Generate AUC ROC curves
lg_roc_auc = roc_auc_score(y_test, model1.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model1.predict_proba(X_test)[:,1])
lg_roc_auc
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % lg_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
# Model Improvement: Hyper Parameter Tuning:
# Get Params of logistic regression
model1.get_params()
# Loop thru various "solver" to check diff values
# solver can only be used with l2, But 'liblinear' works with both 'l1' and 'l2'
train_score=[]
test_score=[]
solver = ['newton-cg','lbfgs','liblinear','sag','saga']
for i in solver:
model = LogisticRegression(random_state=42, penalty='l2', C = 0.75, solver=i)
model_fit = model.fit(X_train, y_train)
y_predict = model.predict(X_test)
train_score.append(round(model.score(X_train, y_train),3))
test_score.append(round(model.score(X_test, y_test),3))
print(solver)
print(train_score)
print(test_score)
model = LogisticRegression(random_state=42, penalty='l1', solver='saga') # changing penalty to l1
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print("Testing accuracy",model.score(X_test, y_test))
model = LogisticRegression(random_state=42, penalty='l1', solver='liblinear') # changing penalty to l1
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print("Testing accuracy",model.score(X_test, y_test))
model = LogisticRegression(random_state=42, solver='liblinear', penalty='l1',class_weight='balanced') # changing class weight to balanced
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Trainig accuracy",model.score(X_train,y_train))
print("Testing accuracy",model.score(X_test, y_test))
# Loop to check diff Threshold values of 'C'
train_score=[]
test_score=[]
C = [0.01,0.1,0.25,0.5,0.75,1]
for i in C:
model = LogisticRegression(random_state=42, solver='liblinear', penalty='l1', class_weight='balanced', C=i) # changing values of C
model_fit=model.fit(X_train, y_train)
y_predict = model.predict(X_test)
train_score.append(round(model.score(X_train,y_train),3)) # appending training accuracy in a blank list for every run of the loop
test_score.append(round(model.score(X_test, y_test),3)) # appending testing accuracy in a blank list for every run of the loop
print(C)
print(train_score)
print(test_score)
# Hence the Final / Best model is:
model = LogisticRegression(random_state=42, solver='liblinear', penalty='l1', class_weight='balanced',C=0.25)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])
# cm
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
# df_cm
plt.figure(figsize = (11,7))
sns.heatmap(df_cm, annot=True, fmt="d", square=True)
plt.ylabel('Actual')
plt.xlabel('Predicted')
print('Here is The FINAL / BEST Model:')
print()
print("Trainig accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print("Recall:",recall_score(y_test,y_predict))
print()
print("Precision:",precision_score(y_test,y_predict))
print()
print("F1 Score:",f1_score(y_test,y_predict))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_predict))
print()
print('Confusion Matrix:')
# Build Decesion Tree:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
# Score the DTree:
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
# Visualize DTree:
train_char_label = ['No', 'Yes']
TD_Tree_File = open('TD_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=TD_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
TD_Tree_File.close()
# Display DTree:
retCode = system("dot -Tpng TD_tree.dot -o TD_tree.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("TD_tree.png"))
# Reducing over fitting by Regularization:
dTreeR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1)
dTreeR.fit(X_train, y_train)
print(dTreeR.score(X_train, y_train))
print(dTreeR.score(X_test, y_test))
# Display DTree:
train_char_label = ['No', 'Yes']
TD_Tree_FileR = open('TD_treeR.dot','w')
dot_data = tree.export_graphviz(dTreeR, out_file=TD_Tree_FileR, feature_names = list(X_train), class_names = list(train_char_label))
TD_Tree_FileR.close()
retCode = system("dot -Tpng TD_treeR.dot -o TD_treeR.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("TD_treeR.png"))
# importance of features in the tree building ( The importance of a feature is computed as the (normalized) total
# reduction of the criterion brought by that feature = Known as the Gini importance:
print (pd.DataFrame(dTreeR.feature_importances_, columns = ["Imp"], index = X_train.columns))
print(dTreeR.score(X_test , y_test))
y_predict = dTreeR.predict(X_test)
y_predict_dt = y_predict
cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
# Build Bagging Learning Model:
bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50,random_state=1)
bgcl = bgcl.fit(X_train, y_train)
y_predict = bgcl.predict(X_test)
y_predict_bg = y_predict
print(bgcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
# Build AdaBooster Learning Model:
abcl = AdaBoostClassifier(n_estimators=10, random_state=1)
abcl = abcl.fit(X_train, y_train)
y_predict = abcl.predict(X_test)
y_predict_ab = y_predict
print(abcl.score(X_test , y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
# Gradient Boost Learning Model:
gbcl = GradientBoostingClassifier(n_estimators = 50,random_state=1)
gbcl = gbcl.fit(X_train, y_train)
y_predict = gbcl.predict(X_test)
y_predict_gb = y_predict
print(gbcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
sns.heatmap(cm, annot=True ,fmt='g')
# Random Forest Classifier Learning Model:
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(X_train, y_train)
y_predict = rfcl.predict(X_test)
y_predict_rf = y_predict
print(rfcl.score(X_test, y_test))
cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
y_predict_dt[:20]
y_predict_bg[:20]
y_predict_ab[:20]
y_predict_bg[:20]
y_predict_rf[:20]
# Combine all the results into a single dataframe ( this is just to aid visualizing the results )
pred_DT = y_predict_dt.reshape(13564,1)
pred_RF = y_predict_dt.reshape(13564,1)
pred_AB = y_predict_dt.reshape(13564,1)
pred_BG = y_predict_dt.reshape(13564,1)
pred_GB = y_predict_dt.reshape(13564,1)
k = np.concatenate((pred_DT,pred_RF,pred_AB,pred_BG,pred_GB),axis= 1)
df_k = pd.DataFrame(k,columns= ['DT','RF','AB','BG','GB'])
df_k.head(10)
# Accuracy Scores for different algorithms:
accuracy_score(y_test, df_k['DT'])
accuracy_score(y_test, df_k['RF'])
accuracy_score(y_test, df_k['AB'])
accuracy_score(y_test, df_k['BG'])
accuracy_score(y_test, df_k['GB'])
# Take either the mean or median OR your custom weights:
df_k['mean'] =df_k.iloc[:,0:5].mean(axis=1)
df_k['median'] =df_k.iloc[:,0:5].median(axis=1)
df_k['mode'] =df_k.iloc[:,0:5].mode(axis=1).iloc[:,0]
# Check Final Accuracy Scores for either of these combinations:
accuracy_score(y_test, df_k['mean'].round(decimals=0))
accuracy_score(y_test, df_k['median'])
accuracy_score(y_test, df_k['mode'])